The data files are from two mobility companies. We are interested in learning more about the dynamics of the market in which they operate.
import warnings
warnings.filterwarnings("ignore")
import glob
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
files = glob.glob("*.csv")
print("There are {} data files (training data as well as testing data)".format(len(files)))
files
company_a_1 = pd.read_csv("company_a_1.csv")
company_a_1.head()
company_a_2 = pd.read_csv("company_a_2.csv")
company_a_2.head()
company_a_3 = pd.read_csv("company_a_3.csv")
company_a_3.head()
company_a_4 = pd.read_csv("company_a_4.csv")
company_a_4.head()
company_a_5 = pd.read_csv("company_a_5.csv")
company_a_5.head()
company_b_1 = pd.read_csv("company_b_1.csv")
company_b_1.head()
company_b_2 = pd.read_csv("company_b_2.csv")
company_b_2.head()
Distances in test data is in meters, while training it is Kilometers, so convert distance to Kilometer first.
import geopy.distance
coords_1 = (38.23, -85.75)
coords_2 = (38.25, -85.76)
print(geopy.distance.geodesic(coords_1, coords_2).km)
training_data = pd.concat([company_a_1, company_a_2, company_a_3, company_a_4, company_a_5])
training_data.shape
testing_data = pd.concat([company_b_1, company_b_2])
testing_data.shape
training_data["StartTime"] = pd.to_datetime(training_data["StartTime"])
training_data["EndTime"] = pd.to_datetime(training_data["EndTime"])
testing_data["start_time"] = pd.to_datetime(testing_data["start_time"])
testing_data["completed_time"] = pd.to_datetime(testing_data["completed_time"])
Also calculate trip duration using formula
$$tripDuration = time_{start} - time_{end}$$training_data["year"] = training_data["StartTime"].dt.year
testing_data["year"] = testing_data["start_time"].dt.year
training_data["month"] = training_data["StartTime"].dt.month_name()
testing_data["month"] = testing_data["start_time"].dt.month_name()
training_data["day"] = training_data["StartTime"].dt.day_name()
testing_data["day"] = testing_data["start_time"].dt.day_name()
training_data["hour"] = training_data["StartTime"].dt.hour
testing_data["hour"] = testing_data["start_time"].dt.hour
training_data["tripDuration(minutes)"] = round((training_data["EndTime"] - training_data["StartTime"]).dt.total_seconds()/60)
testing_data["tripDuration(minutes)"] = round((testing_data["completed_time"] - testing_data["start_time"]).dt.total_seconds()/60)
training_data.head()
training_data.nunique()
testing_data.nunique()
training_data.isna().sum()
testing_data.isna().sum()
training_data.columns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import numpy as np
def plotOutliers(data, c):
print("Feature Name:", c)
figure(figsize=(18, 8), dpi=80)
plt.plot([i for i in range(len(data[c].values))], data[c].values)
df = np.array(data[c])
print("Highest allowed",np.mean(df) + 3* np.std(df))
print("Lowest allowed",np.mean(df) - 3* np.std(df))
H = np.mean(df) + 3* np.std(df)
L = np.mean(df) - 3* np.std(df)
values = df[(df > H) | (df < L)]
if len(values) < 10:
print(values)
R = []
for i in data[c].values:
if i > H or i < L:
R.append(i)
else:
R.append(data[c].mean())
plt.plot([i for i in range(len(data[c].values))], R, linestyle='None', marker="*", markersize=15, label="Outliers")
plt.title(c)
plt.legend()
plt.show()
c = 'StartLatitude'
plotOutliers(training_data, c)
c = 'StartLongitude'
plotOutliers(training_data, c)
c = 'EndLatitude'
plotOutliers(training_data, c)
There are some invalid locations i.e locations outside city.
invalid_locations = [-122.42, -111.87, -111.87, -111.87, 26.12]
training_data = training_data[~training_data["EndLatitude"].isin(invalid_locations)]
training_data = training_data[training_data["EndLatitude"] != 0]
c = 'EndLongitude'
plotOutliers(training_data, c)
training_data = training_data[~training_data["EndLongitude"].isna()]
training_data = training_data[~training_data["EndLongitude"].isin(invalid_locations)]
training_data = training_data[training_data["EndLongitude"] != 0]
c = 'TripDistance'
plotOutliers(training_data, c)
invalid_values = [1111734.31, 1334335.28, 1334377.66, 1213527.47, 1334328.7, 822605.85, 1327569.19]
training_data = training_data[~training_data["TripDistance"].isin(invalid_values)]
training_data = training_data[training_data["TripDistance"] > 0]
c = 'tripDuration(minutes)'
plotOutliers(training_data, c)
invalid_values = [3167, 2843]
training_data = training_data[~training_data["tripDuration(minutes)"].isin(invalid_values)]
training_data = training_data[training_data["tripDuration(minutes)"] > 0]
c = 'TripDistance'
plotOutliers(training_data, c)
training_data = training_data[(training_data["TripDistance"] < 2000) & (training_data["TripDistance"] > 0)]
c = 'start_latitude'
plotOutliers(testing_data, c)
c = 'start_longitude'
plotOutliers(testing_data, c)
c = 'end_latitude'
plotOutliers(testing_data, c)
The regions shown in red circles are the outliers, because usually this is not possible to have cross-state rides or very far rides.
invalid_values = [26.12, 19.42]
testing_data = testing_data[~testing_data["end_latitude"].isin(invalid_values)]
testing_data = testing_data[testing_data["end_latitude"] != 0]
c = 'end_longitude'
plotOutliers(testing_data, c)
invalid_values = [-111.94, 2.36, 2.32, 2.33, -121.89, -119.81]
testing_data = testing_data[~testing_data["end_longitude"].isin(invalid_values)]
testing_data = testing_data[testing_data["end_longitude"] != 0]
c = 'tripDuration(minutes)'
plotOutliers(testing_data, c)
testing_data = testing_data[(testing_data["tripDuration(minutes)"] < 400) & (testing_data["tripDuration(minutes)"] > 0)]
c = 'distance_meters'
plotOutliers(testing_data, c)
testing_data = testing_data[(testing_data["distance_meters"] < 50000) & (testing_data["distance_meters"] > 0)]
training_data.to_csv("company_A.csv", index=False)
testing_data.to_csv("company_B.csv", index=False)
Note: Figure 1 Generated using Power BI - Left (Starting Points of Journey) - Right (Ending Points of Journey)
Note: Figure 2 Generated using Power BI - Left (Starting Points of Journey) - Right (Ending Points of Journey)
Note: This external data can be used for deep and detailed analysis.
https://censusreporter.org/profiles/05000US21111-jefferson-county-ky/
The population of mentioned state is good enough to target for company.
18 to 64 years. They have high chances to use scooter. Rest are either children or elders which seldom use scooters. Hence the use of scooter reduced from 766000 to 473000 (if all people use scooter)473000 to 238000 (which is 50% of 473000)def showHist(columnName1, columnName2):
fig, a = plt.subplots(1, 2, figsize=(20, 5))
try:
a[0].hist(training_data[columnName1])
a[1].hist(testing_data[columnName2])
except:
training_data[columnName1].value_counts().head(20).plot(kind='barh', ax=a[0]).sort_index()
testing_data[columnName2].value_counts().head(20).plot(kind='barh', ax=a[1]).sort_index()
a[0].set_title(columnName1 + ' - Company A')
a[1].set_title(columnName2 + ' - Company B')
fig.tight_layout()
plt.subplots_adjust(left=0.125,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.35)
plt.show()
showHist("StartLatitude", "start_latitude")
showHist("StartLongitude", "start_longitude")
showHist("EndLatitude", "end_latitude")
showHist("EndLongitude", "end_longitude")
training_data["distance_meters"] = training_data["TripDistance"] * 1000
testing_data["distance_meters"].min()
showHist("distance_meters", "distance_meters")
showHist("year", "year")
showHist("day", "day")
showHist("hour", "hour")
showHist("month", "month")
showHist("tripDuration(minutes)", "tripDuration(minutes)")
By looking at above histograms, I have concluded following
print("Look at the Year Wise Data - Company A")
print(training_data["year"].value_counts())
print("\n\nLook at the Year Wise Data - Company B")
print(testing_data["year"].value_counts())
print('For analysis we need to get data of same year as Company A')
testing_data = testing_data[testing_data["year"] == 2018]
print("Take December Data from both Datasets")
training_data = training_data[training_data["month"] == "December"]
testing_data = testing_data[testing_data["month"] == "December"]
print('There are {} records in Company A'.format(training_data.shape[0]))
print('There are {} records in Company B'.format(testing_data.shape[0]))
training_data.nunique()
scoters_trips = training_data[["ScooterID", "TripID"]].groupby(["ScooterID"]).count().reset_index()
scoters_trips = scoters_trips.sort_values(by="TripID", ascending=False)
scoters_trips
print("There are total {} scooters of company A".format(scoters_trips["ScooterID"].nunique()))
average_trips_per_scooter = round(training_data["TripID"].nunique() / scoters_trips["ScooterID"].nunique())
print("On Average {} trips are done by single scooter".format(average_trips_per_scooter))
month_trips_a = training_data[["month", "TripID", "ScooterID"]].groupby(["month"]).nunique().reset_index()
month_trips_a = month_trips_a.sort_values(by="TripID", ascending=False)
print("Monthly Trips")
print("On Average company A have {} trips".format(round(month_trips_a["TripID"].mean())))
print("On Average company A have {} active scooters".format(round(month_trips_a["ScooterID"].mean())))
day_trips_a = training_data[["month","day","TripID", "ScooterID"]].groupby(["month","day"]).nunique().reset_index()
day_trips_a = day_trips_a.sort_values(by="TripID", ascending=False)
print("Daily Trips")
print("On Average company A have {} trips".format(round(day_trips_a["TripID"].mean())))
print("On Average company A have {} active scooters".format(round(day_trips_a["ScooterID"].mean())))
hourly_trips_a = training_data[["month","day", "hour","TripID", "ScooterID"]].groupby(["month","day","hour"]).nunique().reset_index()
hourly_trips_a = hourly_trips_a.sort_values(by="TripID", ascending=False)
print("Hourly Trips")
print("On Average company A have {} trips".format(round(hourly_trips_a["TripID"].mean())))
print("On Average company A have {} active scooters".format(round(hourly_trips_a["ScooterID"].mean())))
month_trips_b = testing_data[["month", "trip_id"]].groupby(["month"]).nunique().reset_index()
month_trips_b = month_trips_b.sort_values(by="trip_id", ascending=False)
print("Monthly Trips")
print("On Average company B have {} trips".format(round(month_trips_b["trip_id"].mean())))
day_trips_b = testing_data[["month","day", "trip_id"]].groupby(["month","day"]).nunique().reset_index()
print("Daily Trips")
print("On Average company B have {} trips".format(round(day_trips_b["trip_id"].mean())))
hourly_trips_b = testing_data[["month","day", "hour","trip_id"]].groupby(["month","day","hour"]).nunique().reset_index()
hourly_trips_b = hourly_trips_b.sort_values(by="trip_id", ascending=False)
print("Hourly Trips")
print("On Average company B have {} trips".format(round(hourly_trips_b["trip_id"].mean())))
I am calculating different features for Company A and same features for Company B. Then finding ration of feature of company B to A using formula;
$$ratio_{feature_i} = \frac {Feature_{companyB}} {Feature_{companyA}}$$Then for calculating average of all ratios using
$$values = \frac {ratio_i} N$$Where N is the number of ratios.
The final comparative constant is calculated using weighted average; i.e
$$constant = \sum_{i=1}^N {values_i * weight_i}$$I have kept following weights;
0.3 for all averages_values (avg_of_avg)0.2 for all maximum_values (avg_of_maximum)0.2 for all minimum_values (avg_of_minimum)0.3 for all sum values (avg_of_sum)Note: These weight values are calculated on the basis of optimization i-e hit and trial method used to calculate optimal value
print("Company A - Calculations")
print("-------------------------")
a_trips = training_data["TripID"].nunique()
a_trip_distance_minimum = training_data["distance_meters"].min()
a_trip_distance_maximum = training_data["distance_meters"].max()
a_trip_distance_average = training_data["distance_meters"].mean()
a_trip_distance_sum = training_data["distance_meters"].sum()
a_trip_duration_minimum = training_data["tripDuration(minutes)"].min()
a_trip_duration_maximum = training_data["tripDuration(minutes)"].max()
a_trip_duration_average = training_data["tripDuration(minutes)"].mean()
a_trip_duration_sum = training_data["tripDuration(minutes)"].sum()
a_average_monthly_trip = round(month_trips_a["TripID"].mean())
a_minimum_monthly_trip = round(month_trips_a["TripID"].min())
a_maximum_monthly_trip = round(month_trips_a["TripID"].max())
a_average_daily_trip = round(day_trips_a["TripID"].mean())
a_minimum_daily_trip = round(day_trips_a["TripID"].min())
a_maximum_daily_trip = round(day_trips_a["TripID"].max())
a_average_hourly_trip = round(hourly_trips_a["TripID"].mean())
a_minimum_hourly_trip = round(hourly_trips_a["TripID"].min())
a_maximum_hourly_trip = round(hourly_trips_a["TripID"].max())
print("Company B - Calculations")
print("-------------------------")
b_trips = testing_data["trip_id"].nunique()
b_trip_distance_minimum = testing_data["distance_meters"].min()
b_trip_distance_maximum = testing_data["distance_meters"].max()
b_trip_distance_average = testing_data["distance_meters"].mean()
b_trip_distance_sum = testing_data["distance_meters"].sum()
b_trip_duration_minimum = testing_data["tripDuration(minutes)"].min()
b_trip_duration_maximum = testing_data["tripDuration(minutes)"].max()
b_trip_duration_average = testing_data["tripDuration(minutes)"].mean()
b_trip_duration_sum = testing_data["tripDuration(minutes)"].sum()
b_average_monthly_trip = round(month_trips_b["trip_id"].mean())
b_minimum_monthly_trip = round(month_trips_b["trip_id"].min())
b_maximum_monthly_trip = round(month_trips_b["trip_id"].max())
b_average_daily_trip = round(day_trips_b["trip_id"].mean())
b_minimum_daily_trip = round(day_trips_b["trip_id"].min())
b_maximum_daily_trip = round(day_trips_b["trip_id"].max())
b_average_hourly_trip = round(hourly_trips_b["trip_id"].mean())
b_minimum_hourly_trip = round(hourly_trips_b["trip_id"].min())
b_maximum_hourly_trip = round(hourly_trips_b["trip_id"].max())
b_a_trip_ratio = b_trips / a_trips
b_a_distance_min_ratio = b_trip_distance_minimum / a_trip_distance_minimum
b_a_distance_max_ratio = b_trip_distance_maximum / a_trip_distance_maximum
b_a_distance_avg_ratio = b_trip_distance_average / a_trip_distance_average
b_a_distance_sum_ratio = b_trip_distance_sum / a_trip_distance_sum
b_a_duration_min_ratio = b_trip_duration_minimum / a_trip_duration_minimum
b_a_duration_max_ratio = b_trip_duration_maximum / a_trip_duration_maximum
b_a_duration_avg_ratio = b_trip_duration_average / a_trip_duration_average
b_a_duration_sum_ratio = b_trip_duration_sum / a_trip_duration_sum
b_a_average_monthly_trip = b_average_monthly_trip / a_average_monthly_trip
b_a_minimum_monthly_trip = b_minimum_monthly_trip / a_minimum_monthly_trip
b_a_maximum_monthly_trip = b_maximum_monthly_trip / a_maximum_monthly_trip
b_a_average_daily_trip = b_average_daily_trip / a_average_daily_trip
b_a_minimum_daily_trip = b_minimum_daily_trip / a_minimum_daily_trip
b_a_maximum_daily_trip = b_maximum_daily_trip / a_maximum_daily_trip
b_a_average_hourly_trip = b_average_hourly_trip / a_average_hourly_trip
b_a_minimum_hourly_trip = b_minimum_hourly_trip / a_minimum_hourly_trip
b_a_maximum_hourly_trip = b_maximum_hourly_trip / a_maximum_hourly_trip
print("Calculate Ratios")
print("-----------------------------------------------")
print("Number of Trips = ", b_a_trip_ratio)
print("-----------------------------------------------")
print("Minimum Distance Ratio = ", b_a_distance_min_ratio)
print("Maximum Distance Ratio = ", b_a_distance_max_ratio)
print("Average Distance Ratio = ", b_a_distance_avg_ratio)
print("Sum Distance Ratio = ", b_a_distance_sum_ratio)
print("-----------------------------------------------")
print("Minimum Duration Ratio = ", b_a_duration_min_ratio)
print("Maximum Duration Ratio = ", b_a_duration_max_ratio)
print("Average Duration Ratio = ", b_a_duration_avg_ratio)
print("Sum Duration Ratio = ", b_a_duration_sum_ratio)
print("-----------------------------------------------")
print("Minimum Monthly Trips Ratio = ", b_a_average_monthly_trip)
print("Maximum Monthly Trips Ratio = ", b_a_minimum_monthly_trip)
print("Average Monthly Trips Ratio = ", b_a_maximum_monthly_trip)
print("-----------------------------------------------")
print("Minimum Daily Trips Ratio = ", b_a_average_daily_trip)
print("Maximum Daily Trips Ratio = ", b_a_minimum_daily_trip)
print("Average Daily Trips Ratio = ", b_a_maximum_daily_trip)
print("-----------------------------------------------")
print("Minimum Hourly Trips Ratio = ", b_a_average_hourly_trip)
print("Maximum Hourly Trips Ratio = ", b_a_minimum_hourly_trip)
print("Average Hourly Trips Ratio = ", b_a_maximum_hourly_trip)
avg_of_minimum = (b_a_distance_min_ratio + b_a_duration_min_ratio + b_a_minimum_monthly_trip + b_a_minimum_daily_trip + b_a_minimum_hourly_trip)/5
avg_of_maximum = (b_a_distance_max_ratio + b_a_duration_max_ratio + b_a_maximum_monthly_trip + b_a_maximum_daily_trip + b_a_maximum_hourly_trip)/5
avg_of_sum = (b_a_distance_sum_ratio + b_a_duration_sum_ratio) /2
avg_of_avg = (b_a_distance_avg_ratio + b_a_duration_avg_ratio + b_a_average_monthly_trip + b_a_average_daily_trip + b_a_average_hourly_trip)/5
avg_of_avg, avg_of_maximum, avg_of_minimum, avg_of_sum
overall_average = (avg_of_avg*0.3 + avg_of_maximum*0.2 + avg_of_minimum*0.2 + avg_of_sum*0.3)
overall_average
print("Estimated Number of Trips of Company B = ", round(training_data["TripID"].nunique()*overall_average))
print("Actual Number of Trips of Company B = ", testing_data["trip_id"].nunique())
print("Estimated Average Trip Duration of Company B = ", round(training_data["tripDuration(minutes)"].mean()*overall_average))
print("Actual Average Trip Duration of Company B = ", round(testing_data["tripDuration(minutes)"].mean()))
print("Estimated Average Distance Covered of Company B = ", round(training_data["distance_meters"].mean()*overall_average))
print("Actual Average Distance Covered of Company B = ", round(testing_data["distance_meters"].mean()))
round(training_data["ScooterID"].nunique()*overall_average)
trip_ratio = testing_data["trip_id"].nunique() / training_data["TripID"].nunique()
print('The Ratio of Trips of Company B to A = ', trip_ratio)
duration_ratio = testing_data["tripDuration(minutes)"].mean() / training_data["tripDuration(minutes)"].mean()
print('The Ratio of Average Duration of Company B to A = ', duration_ratio)
duration_ratio = testing_data["distance_meters"].mean() / training_data["distance_meters"].mean()
print('The Ratio of Average Distance Covered of Company B to A = ', duration_ratio)
The number of trips by Company A's scooters are almost double of company B's scooters, because Company B's scooters usually cover long distance and the trip duration is near to double of Company A. Therefor we can conclude that the number of scooters of company B is same region will less than company A, hence
260¶Figure 3: (generated using PowerBI) - Left (Starting Points of Journey) - Right (Ending Points of Journey)
Figure 4: (generated using PowerBI) - Left (Starting Points of Journey) - Right (Ending Points of Journey)
Both of above stratagies will help us to reduce response and turn around time (time required to move from source to destination and then back).